Speech recognition apparatus

ABSTRACT

Speech recognition is effected by zero crossing analysis of the speech waveform wherein the time interval between consecutive zero crossings is measured, with these intervals and combinations thereof being subsequently identified. Measurement of the intervals is based on a nonlinear timescale, the rate of generation of which is dependent on the fundamental frequency of the speaker. Alterations in the timescale generation are effected due to the initial time constant of the timescale generator circuitry being proportional to and controlled by the variations in the fundamental frequency of the speech waveform.

XR 3553372 SR "States Patent [72] Inventors Esmond Philip Goodwin WrightBishop 5 Stortford; Wincenty B81116, Harlow, Essex, England [2]] Appl.No. 587,539 [22] Filed Oct. 18, 1966 [45] Patented Jan. 5, 1971 [73]Assignee International Standard Electric Corporation New York, N.Y. acorporation of Delaware [32] Priority Nov. 5, 1965 [3 3 1 Great BritainI 31 l 46 984/65 [54] SPEECH RECOGNITION APPARATUS 2 Claims, 12 DrawingFigs.

[52} U.S. Cl 179/1, 324/77 [51] lnt.Cl G10l1/00 [50] Field of Search179/ 1 AS;

340/l46.3(lnquired); 324/77; 328/151 [56] References Cited UNITED STATESPATENTS 3,102,928 9/1963 Schroeder 179/1(AS) 3,278,685 10/1966 Harper179/l(AS) 3,335,225 8/1967 Campanella.. l79/l(AS) 3,416,080 12/1968Wright et a1. 179/1 (AS)X Primary Examiner-Kathleen H. Claffy AssistantExaminer-Charles W. Jirauch Atrorneys-Percy P. Lantzy, C. CornellRemsen, .lr., Rayson P. Morris, Philip M. Bolton and lsidore Togut atsta e 204520 9 awia elmsm e Ale $21M- mat/on 297 l l l Phaneme g l Aecfg Inf/an Output PATENTED JAN SIQYI SHEET u. or 6 Inventors ESMOMD E G.WR/GHT W/NCENTY 85205. I

Atlorney PATENTED JAN 5 I971 SHEET 8 OF 6 Inventors ESMOND G. MIG/6H7W/NCENTY BEZOEL A ttorn e y SPEECH RECOGNITION APPARATUS This inventionrelates to speech recognition equipment in which automatic adjustmenttakes place to enable the equipment to suit itself to the speechcharacteristics of different talkers.

In our copending application Ser. No. 437,349 filed Mar. 2, I965 forApparatus for the Analysis of Waveforms, now issued as U.S. Pat. No.3,416,080, there is described apparatus for speech recognition in whichspeech recognition is accomplished by analysis of the zero crossingintervals in the speech wave. Every word has, within fairly wide limits,a recognizable pattern of zero crossings which can be divided intogroups representing different sounds; the crossings making up a groupbeing in turn identified by their number and timing relative to eachother. Such'a method of speech recognition can be distinguished fromfrequency spectrum analysis in as much as the information bearingparameters can be converted into a time or digital domain in the case ofzero crossing analysis. The zero crossing intervals making up each groupare counted under the control of a suitable nonlinear time scale.

According to the present invention there is provided speech recognitionapparatus including means for detecting reversals of polarity in thespeech waveform, means for generating a measuring time scale waveformwhen a reversal is detected, means for counting the number of time scaleunits generated between the detected reversal and the next detectedreversal and means for altering the scale of the time scale waveformaccording to a characteristic of the speech waveform.

In a preferred embodiment of the present invention there is providedmeans for producing a voltage proportional to the fundamental frequencyof the speech waveform and means for generating a nonlinear pulse traintime scale, the initial time constant of the pulse generator beingcontrolled by and proportional to the voltage derived from thefundamental frequency.

The above and other features of the invention will become more readilyapparent and be better understood from the following description of anembodiment thereof, taken in conjunction with the accompanying drawingsin which:

FIG. I illustrates a typical speech waveform and the timing of the zerocrossings contained therein,

FIG. 2 illustrates an alternative method of locating the zero crossingsin the waveform,

FIG. 3 is a nonlinear timescale,

FIG. 4 is a block diagram of a circuit arranged to time the intervalsbetween successive zero crossings in a waveform,

FIG. 5 illustrates a method of extracting zero crossings from thewaveform,

FIG. 6 is a circuit by which the square wave shown in FIG. 5 may beobtained,

FIG. 7 is a block diagram of a circuit by which a limited number ofparts of speech may be recognized,

FIG. 8 is a block diagram of an arrangement by which a larger vocabularymay be recognized, and

FIGS. 9 and 10 illustrate sections of FIG. 8,

FIG. 11 illustrates the nonlinear pulse train timescale generatingcircuit, and

FIG. 12 illustrates diagrammatically two nonlinear pulse time scalesderived for different fundamental frequencies.

A fundamental aspect of speech recognition is the ability to extractfrom a speech waveform features such as frequencies, amplitudes, phaserelationships etc., which can be recognized as conforming to certainknown patterns for each type of speech sound. These features can beextracted and, with the aid of modern computers, measured, classified,stored and compared with various standards or reference patterns.

One method of analyzing speech waveforms for the purpose of extractingrecognizable features therefrom is to count and measure the intervalsbetween zero crossings of the waveform. A refinement of this techniqueis to count the number of com binations of zero crossing intervals thatconform to a particular pattern. For example the speech waveform may beanalyzed to ascertain the number of adjacent pairs of zero crossingintervals where the first interval falls within the range between I and1.5 msec and is followed by an interval that falls within the rangebetween 0.5 and 0.7 msec.

FIG. 1 illustrates a speech waveform 11 having zero crossings 12 to 20.The intervals between these zero crossings are represented as periods oftime 21 to 28. The timing of these intervals is achieved by counting thenumber of timescale units generated by a timescale which is started whena zero crossing is detected. Thus interval 21 is timed as being Itimescale unit in duration, while interval 24 is 3 timescale units induration.

Whilst it has been assumed that the intervals between the actual zerocrossings can be timed and counted, in practice it may be found thatunwanted noise in the waveform will produce spurious zero crossings. Toovercome this it can be arranged that instead of detecting the actualzero crossings, the analysis is based on the detection of those pointswhere the waveform alternately exceeds positive and negative thresholdamplitudes. This is illustrated in FIG. 2, in which the waveform 31 isdepicted as crossing the positive threshold at points 32, 34, 36, 38 and40, and crossing the negative threshold at points 33, 35, 37 and 39.This arrangement can be adopted because most of the noise in thewaveform is of small amplitude compared with the speech waveform.Therefore the threshold values can be chosen so that the noise contentof the waveform lies between them; and detection of the points 32 to 40will not include spurious zero crossings. It will be noted that thethreshold crossings do not depart significantly from the zero crossings,and in practice the intervals between the threshold crossings will besubstantially the same as the intervals between the zero crossings.

Therefore, for the remainder of this specification the term zerocrossings will be used to denote both actual zero crossings andthreshold crossings.

It has been stated above that the intervals between zero crossings aretimed by counting timescale units, the timescale being started afresh ineach case when a zero crossing is detected.

The relation between the measured interval Z,, the counting period t,.-,and the count number n is:

It should be noted that Z cy of the zero crossing wave.*-

Considering the lower and upper end frequencies of this wave, namely, f,and f then where f is the counting rate, or pulse repetition frequencyin the case of a pulse timescale.

Thus f A C (2n+1 )n- (n+1 where is the center frequency, and B (f f/2f,n (Bandwidth).

In the previous discussion, it was assumed that the counting rate wasconstant during the measured interval or channel. The principaldisadvantage of this technique is that the accuracy of measurementdepends directly upon the frequency of the signal to be measured. It canbe seen that a low frequency or long interval will be measured veryaccurately compared with the measurement of a high frequency or shortinterval.

In terms of frequency bands, each count number at the lower end of themeasured spectrum will produce a bandwidth which is too narrow, and eachcounter number at the higher end will produce a bandwidth which is toowide. For example, consider that the counting rate is 10 kc./s. Theinterval between two successive counts is equivalent of 5 kc./s.However, substitution of n in the preceding formulas shows that where nis equal to l, the band is equivalent to 2,500 to 5,000 c./s. Similarlyit is possible to show that for n 15 the frequency band is 300 to 330c./s.

In any practical application of this counting technique, it is mostdesirable to increase the number of counts for a high frequency, i.e.reduce the width of the band, and to decrease the number of'counts for alower frequency, i.e. increase the where f is the frequenwidth of theband. A possible method of achieving this object is to use a nonlinearmeasuring scale so that the counting rate is effectively different inadjacent channels.

The formulas which were derived previously for counting frequency, countnumber, etc., still apply. However, instead of using f one has tosubstitute a function relating f to either time, or to count number.

This function has the form f (n) =f, 1+ logf(n)) wheref is the frequencyof the first pulse.

FIG. 3 depicts a nonlinear timescale such as is used in FIGS. 1 and 2.

FIG. 4 illustrates by block diagrams a circuit for timing the intervalsbetween successive zero crossings in a waveform such as that shown ineither FIG. 1 or FIG. 2.

The equipments denoted by the various blocks in the drawings are knownelectronic circuits and do not in themselves constitute novel featuresof the invention.

The incoming speech waveform 50 is fed to a wave-shaping circuit 51 usedto identify the zero crossings. The identification may be performedaccording to the procedures outlined with reference to FIG. 2. Theoutput from the wave-shaping circuit may take the form of a square wave,as shown in FIG. 5. It will be seen that the waveform 61 in FIG. can beused to produce a square wave 62 having the same zero crossingcharacteristics as the waveform 61. Since zero crossing analysis isindependent of amplitude or other factors, a square wave of fixedamplitude having the necessary zero crossing intervals makes a suitabletrigger waveform for operating counters and other circuits.

One method of producing the desired square wave is by utilizing thecircuit shown in FIG. 6. In this FIG., transistor 70 operates as anamplifier for the speech input, which is limited by amplitude limiterdiodes 68 and 69 so as to avoid overloading of the amplifier. Transistor71 operates as a phase-splitter and converts the amplified and limitedsignal from transistor 70 into two outputs in opposite phase. Theseoutputs are passed to two transistors 72 and 73 operating as emitterfollowers and arranged to reproduce negative going signals only. Thewaveform 63 of FIG. 5 represents the outputs of transistors 72 and 73added together. These two outputs are taken to the inputs of a pair oftrigger transistors 74 and 75. The trigger can be set to a thresholdvalue which is adjustable by means of a potentiometer 76 in the commonemitter connection of the two transistors. The outputs from the circuitare derived from two inverter transistors 77 and 78, and are representedby the square wave 62 in FIG. 5.

The circuit of FIG. 6 is biased where shown by voltages V+ or V-, allofequal amplitude with respect to ground.

Returning to FIG. 4, the output of the wave-shaping circuit is appliedto a measuring circuit 55 which includes separate timescale countingcircuits 52 and 53, and a timescale generating circuit 54.

As has been previously stated the timescale generated is nonlinear, andrecommences when each zero crossing is detected. The counter 52 isarranged to count the timescale units following all zero crossings goingpositive, and the counter 53 is arranged to count the timescale unitsfollowing all negative going zero crossings.

Switches 56 and 57 can be set to select the counts of either counter 52or 53, and the selected count is passed through a gate 58 which is underthe control of a threshold and control circuit 59. This threshold andcontrol circuit is used to control the time during which an examinationof zero crossings is made. The results of each examination are displayedin a display counter 60, which registers the total number of zerocrossings which occur during examination time.

The equipment depicted in FIG. 4 can be arranged to make various typesof examination of the speech waveform 50, for example:

I. It can count the number of zero crossing intervals that fall into thetime range between I msec and 1.5 msec.

II. It can count the number of combinations of intervals, such as thosecombinations where an interval of between I msec and 1.5 msec isfollowed by an interval of between 0.5 msec and 0.7 msec.

The recognition of simple parts of speech (not in the grammaticalsense), such as digits zero to nine, as opposed to simple waveformanalysis, can be achieved by an arrangement such as that shown in FIG.7. It consists of a squaring circuit 80 whichidentifies the zerocrossing intervals, a measuring circuit 81 which measures the zerocrossing intervals, and a gating circuit 82 which sorts the zerocrossing intervals into seven interval ranges, referred to as channelsCH, as follows:

CHI-00 to 1.31 msec CH2-1.3l to 0.93 msec CH3-0.93 to 0.73 msec CH4-0.73to 0.42 msec CH5-0.42 to 0.3] msec CH6 0.3] to 0.18 msec CH7 0.18 to Omsec.

A threshold circuit 83 provides on or off signals during the presence orabsence of speech signals, and controls a timing circuit 84 whichprovides the following outputs:

(1) Output when speech signals persist more than 100 msec. (beginning ofthe word) (ii) Output when speech signal is absent for more than 200msec. (end of word) (iii)dOutput (D1) for the first 100 msec. of the wor(iv) Output (D2) for the 350 msec. following first 100 msec. of speechsignal (v) Output (D3) for the first 100 msec. after a gap shorter than200 msec A group of threshold counters 85 are set to count the number ofzero crossing intervals in a given channel. Each threshold counterproduces an output when a threshold to which the counter is preset isreached. The following threshold counters (TC) are provided.

TCl for CHI TCZ for CH1 +CH2 TC3 for CH3 CH4 TC4 for CH5 TC5 forCH6+CH7Finally a gating circuit 86 is used to identify spoken digits accordingto the following patterns GATE CONDITION example, the unit marked 88classifies the voiced or unvoiced characteristics. Units 89 and 90isolate the first and second frequency ranges corresponding to formantsof vowel sounds respectively and pass the vowel information in the formof zero crossings. Unit 91 extracts the fundamental frequency of atalker. Units marked 92 and 93 extract two groups of frequencies withrespect to unvoiced sounds, and unit 94 detects consonant groups. Theunit 95 is a threshold detector enstsr t vedin d ec The complexity ofthe first stage in the classification of speech characteristics dependsmainly on the size of vocabulary and the range of talkers. For example,for the recognition of vowels it may be sutficient to analyze only onefrequency reuse-s. i V. i

In the second stage of the recognition process analysis is performed onthe portions of speech which were separated in the first stage. Thisanalysis leads to the recognition of specific voiced and unvoiced soundsby the recognition circuits 97 and 98. The analysis is performed duringthe time controlled by a sample A which covers a segment of sound. Thesame analysis 7 is repeated for any subsequent segment of the speechwave. The length of each segment, e.g. sample A, is determined by thefundamental frequency of the talker. This is the function of tl i emeasuring and segmentation unit 99.

FIG. 9 shows in more detaila part of a vowel recognition arrangement.Information-is derived from the zero crossings of the first formant andthe analysis is done by measuring zero crossing distances and extractingonly the significant ones. The zero crossing intervals are measured inthe unit 102, and the timing control 103. controlled by sample pulseA,selects the period during which the zero crossing distances aremeastated. The significant zero crossing distances extracted by the unit102 are stored in the storage units marked D1, D2 Dn. As has been statedabove, the length of each sample of speech is determined by thefundamental frequency of the talker. The fundamental frequency alsocontrols measurement of zero crossing distances. One sample constitutesthe shortest recognizable portion of a sound. In the case of vowelsthese portions may be referred to as little vowels." For example, duringan uttering of the sound a recognition of a segment of the sound canconsist of the following series of samples This series is stored asthree as and two 0's. The recognition of each sample is-performed by therecognition circuit 104 under the control of the sample pulse A and whena sufficient number of samples have been recognized a complete group ofsamples, i.e. a segment, is recognized by the recognition circuit 105under the control of a segment pulse B. The recogniiion of the group ofsamples given above, under the control of the segment pulse 8, indicatesthat the unknown letter sound was a. The segment 13 covers a number ofsamples A which is sufficient to make a decision on the unknown sound.

Recognition of a group of parameters, such as zero crossing distances orlittle vowels, and so on, can be accomplished by straightforwardthreshold circuit followed by logical gating m by a statistical decisioncircuit.

An example of the latter is shown schematically in FIG. 10. The outputfrom each parameter (a parameter can be represented as either I or Ovoltage levels, or as an analogue or quantized voltage level) is takenvia resistor Ri to a point recognizing, for example, a, 0 etc. The valueof the resistor 'Ri represents a weighted contribution of a givenparameter to the recognition of a, o'etc., and is such that ROIRI lwhere R0 is a constant of the adding circuit. Contributions of Ri shouldsatisfy the expression for all i s associated with a given point, say,a, 0 etc.

Similarly the unvoiced sounds are recognized by the recognition circuit98.

As in the first stage, complexity of the remaining stages in therecognition process is mainly related to the size of vocabulary and therange of talkers.- For example, voiced, unvoiced and phoneme recognitioncan be reduced to one unit. The phoneme recognition circuit 100 and-theword recognition circuit 101 are arranged on the same lines aspreviously described with reference to FIGS. 9 and 10. The maindifference is that in each succeeding recognition sequence another setof parameters is brought into use from the preceding stage. The numberof stages in the recognition process is also related to the size ofvocabulary and the range of talkers. ln the recognition of a shortselected vocabulary it may be quite feasible to recognize wordsdirectly, without dividing them into phonemes, voiced sounds, etc.

In the arrangement shown in FIG. 11 two complementary transistors 201and 202 have their emitters connected together- The base of transistor202 is connected to the collector of transistor 201 by a positivefeedback connection 203.

-' The base of transistor 201 is connected to a bias voltage source at bvia two resistors 210, 211 and is also connected to two groundedcapacitors 212 and 213. Transistors 201 and 91 if? rrrsrsq vsl fl fi a"Wan positive and negative DC, bias supplies are connected as indicatedto the collector and base of transistor 202 and the collector oftransistor 201.

When the base of transistor 201 is driven negative suffciently for it tobegin to conduct then the action of the feedback circuit 203 will startto drive the base of transistor 202 positive. Transistor 202 then beginsto conduct and its emitter-collector current reinforces theemitter-collector cur-- rent of transistor 201 and the rise in emittervoltage of transistor 201 makes it conduct even more. This processcontinues until saturation is reached and the feedback voltage appliedto the base of transistor 202 cannot rise any further.

The capacitors 212, 213 and resistors 210 and 211 control the voltageapplied to the base of transistor 201 in response to a pulse at theinput 204.

initially a bias voltage b at point 208 is arranged to be at least equalto or more positive than the voltage a at point 209. The timing scale isinitiated at time t by a negative going pulse at the input 204, appliedto capacitor 212 by transistor 206. The amplitude of this pulsedetermines the duration T, (Note FIG. 12), of a succession of pulses ina timescale. This negative going pulse at 204 negatively chargescapacitor 212 according to its amplitude. Capacitor 212 immediatelystarts to discharge according to the time constants of 210 and 212. Atthe same time 213, via 211, is charged negatively at a rate determinedby the time constants of 213 and 211. When the voltage on 213 drops to apoint where it is equal to the voltage a at point 209 the base voltageof transistor 20] is sufficiently negative to cause the transistor toconduct. The positive feedback circuit 203 ensures that the rise inconduption of transistors 20! and 202 is very rapid an causes the firsttiming pulse to be delivered to the output 205. When transistor'20l issaturated the drain on capacitor 213 via the base of transistorMeanwhile capacitor 212 has lost some of its negative charge due to thepotential [2 at point 208 and therefore the rate of negative charge ofcapacitor 213 is reduced. Thus the second pulse interval is longer thanthe first, and each succeeding interval is longer than the last. FIG. 12illustrates a timescale P generated by the circuit of FIG. 11.

The negative-going pulses at point 204 are derived from the triggeroutput of the circuit of FIG. 6. This circuit will produce two squarewave output waveforms which have positive-going trigger pulses, eachtrigger pulse in the one square wave output being representative ofapositive-going zero crossing con tained in the input speech wave andeach trigger pulse in the other square wave output being representativeof a negativegoing zero crossing contained in the input speechlwave.Each trigger output is conventionally inverted, the leading edge ofwhich coincides with the positive-going edge of the relevant triggeroutput. These two sets of negative-going pulses have a constant widthand amplitude todefine the period T referred to above. H

If the circuit is left untouchedafier the initial pulse at point 204there will come a time whenthe output pulse interval becomes infinite.However, in practice the period T over which the timescale is requiredto function covers only a small number of pulses, and at the end of thisperiod the timescale will be restarted by receipt of a new negativegoing pulse at point 204. To ensure that the timescale starts from zero,so to speak, at the start time t capacitor 213 isfully dischargedpositively by a positive going pulse applied via the diode 207.

The value of the potential b at point 208 in relation to the potentialat point 209 controls the number and distribution of output pulsesduring a given period T. To alter the scale, i.e. to increase or reduceT for the same number ofpulses with the same pulse interval ratios it isonly necessary to alter the initial negative charge on the capacitor212. The timescale q in FIG.

12 illustrates the effect of reducing the amplitude of the input pulseat point 204.

As noted previously, reference is made'to the use of a nonlineartimescale for counting'zero crossing intervals. In the present inventionthe circuit of'FIG. 11 is used to generate a nonlinear timescale thescale of which is automatically expanded or contracted according to thefundamental frequency or other characteristics of the talker. Thederivation ofa signal representing the fundamental frequency of a talkeris Well known and forms no part of the present invention, see forexample Automatic Ei'g tractiori of the Excitation function of Speechwith Particular Reference to the U se of Correlation Methods by J. S.Gill','Proceedi ngs ofth le Tit International Congress on Acoustics,Stuttgart l99, ge 21 7. The pitch analogue output o f't he" svanddeseribed therein can be converted by means'g'ri' tfs h H )Lt'ofprbvide a controlling voltage waveformfor j t 2(14 iii 't 'h'e nonlineartimescale generator of' FIG thefariiplitude of this voltage beingrelated to the fundari'l "talffie quency or 'object characteristic ofthetalk'er.

It is to be understood that thelfdiegoin'g description of specificexamples of thisiriventiori is niad by way of example only and is not tobe considered as a limitation on its scope.

1. Speech recognition apparatus comprising:

a. a speech waveform source;

b. means coupled to said speech waveform source for detecting waveformreversals of plurality and generating therefrom a correspondingoutputwaveforrn,

c. means coupled to said speech waveform source for detecting thewaveform fundamental frequency and generating therefrom a voltagerepresentative of said fundamental frequency; a

(1. means, responsive to said reversal detector and to said fundamentalfrequency detector, for generating a nonlinear measuring timescalewaveform, said nonlinear timescale being initiated whenever a reversalis detected and altered according to variations in the fundamentalfrequency; and

e. means, coupled to said reversal detector and to said timescalegenerator for counting the number of timescale units generated betweendetected reversals.

2. Apparatus according to claim 1 in which the means for generating thenonlinear time scale includes first and second transistors havingcomplementary symmetry with their emitters connected together, 'apositive feedback connection between the base of the first transistorand the collector of the second transistor, first'and second capacitorsconnected to the base of the second transistor and means for chargingthe first and second capacitors at differentialv rates by the voltagerelated to the fundamental frequency.

1. Speech recognition apparatus comprising: a. a speech waveform source;b. means coupled to said speech waveform source for detecting waveformreversals of plurality and generating therefrom a corresponding outputwaveform; c. means coupled to said speech waveform source for detectingthe waveform fundamental frequency and generating therefrom a voltagerepresentative of said fUndamental frequency; d. means, responsive tosaid reversal detector and to said fundamental frequency detector, forgenerating a nonlinear measuring timescale waveform, said nonlineartimescale being initiated whenever a reversal is detected and alteredaccording to variations in the fundamental frequency; and e. means,coupled to said reversal detector and to said timescale generator forcounting the number of timescale units generated between detectedreversals.
 2. Apparatus according to claim 1 in which the means forgenerating the nonlinear time scale includes first and secondtransistors having complementary symmetry with their emitters connectedtogether, a positive feedback connection between the base of the firsttransistor and the collector of the second transistor, first and secondcapacitors connected to the base of the second transistor and means forcharging the first and second capacitors at differential rates by thevoltage related to the fundamental frequency.